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ABSTRACT It has been claimed recently that it may be possible to predict the rate of de novo mutation of each KEYWORDS 

site in the human genome with a high degree of accuracy [Michaelson et a/. (2012), Cell 151 : 1431 — 1442]. We de novo mutation 

show that this claim is unwarranted. By considering the correlation between the rate of de novo mutation and human mutation 

the predictions from the model of Michaelson et a/., we show there could be substantial unexplained variance in cryptic variation 

the mutation rate. We investigate whether the model of Michaelson et al. captures variation at the single mutation rate 
nucleotide level that is not due to simple context. We show that the model captures a substantial fraction of variation 

this variation at CpG dinucleotides but fails to explain much of the variation at non-CpG sites. mutation rate 



It has been known for some time, from comparative studies, that the 
mutation rate varies at a number of different scales along the human 
genome, from variation between individual nucleotides, to differences 
between whole chromosomes (Hodgkinson and Eyre- Walker 2011). 
Much of this variation has remained unexplained (Hodgkinson and 
Eyre- Walker 2011). However, Michaelson et al. (2012) recently have 
claimed that the rate of mutation at each site is highly predictable. 
They use principle component logistic regression fitted to a dataset of 
653 de novo mutations (DNMs) to estimate a model from which they 
can predict the mutability index (MI), a measure of the mutation rate, 
of each site in the human genome. To assess the fit of the model, they 
count the number of sites in the genome with a particular MI (n) and 
the number of DNMs at those sites (d). They therefore have a pre- 
diction of the mutation rate from their model, the MI, and the ob- 
served rate of mutation, z=d/n. They find a very strong correlation 
between the logarithm of z and MI and infer that 

MI was highly predictive of site-specific (1 bp resolution) 
mutation rates . . . and could explain 90% of the variability in 

mutation rates at sites across the genome We conclude that 

our statistical model of mutability can explain a majority of the 
variance in site-specific mutation rates. 
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However, for each MI value they have thousands to millions of sites. 
As a consequence, any variation that their model does not explain will 
be averaged out when they consider the observed mmiber of mutations. 
This can be illustrated as follows. Consider sites with an MI such that 
their mutation rate is 10~^, approximately the mean mutation rate in 
himians (1000 Genomes Project Consortium 2010; AwadaUa et al. 2010; 
Conrad et al 2011). If the model of Michaelson et al. (2012) explains all 
the variation in the mutation rate then all sites with this MI will have 
a mutation rate of 10~^. However, if there is imexplained variance, the 
mutation rate of each site will deviate from this value. Let us assimie that 
equal numbers of sites with this MI actually have mutation rates of 0.1 x 
10"^ and 1.9 x 10~^; the mean mutation rate for sites with this MI is 
still 10"^. It is clear that if we only sample a few sites, then the observed 
mean mutation rate often will deviate substantially from the average 
value of 10~^, and the correlation between the log of the observed 
number of DNMs and the MI will be correspondingly weak if we have 
similar levels of variation at sites with other MI values. However, as we 
sample more and more sites, the mean value will approach the expected 
value of 10~^ and the correlation between the log of the mmiber of 
DNMs per site and MI will become better [assuming that the model of 
Michaelson et al (2012) explains at least some of the variance]. Because 
there are typically thousands if not millions of sites for each MI value, 
any unexplained variance will be averaged away. 

Here we examine how much variation there could be that is 
unexplained by the model of Michaelson et al (2012). We also test 
whether their model explains a source of variation in the mutation 
rate that is both strong and likely to be difficult to explain - cryptic 
variation in the mutation rate. This is variation that exists at the single 
nucleotide level but which appears to be independent of the context of 
the adjacent nucleotides. Cryptic variation was first discovered in 
mammalian nuclear DNA by noting that there is an excess of 
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single-nucleotide polymorphisms (SNPs) that occur at the same site in 
humans and chimpanzees, even when the influence of the adjacent 
nucleotides on the mutation rate is taken into account (Hodgkinson 
et al. 2009; see also Leffler et al. 2013). A similar excess has been noted 
for single-nucleotide substitutions in pairs of independent primate 
species (Johnson and Helknann 2011). This excess of SNPs does 
not appear to be due to ancestral polymorphism, sequence assembly 
errors, or balanced polymorphism because the allele frequencies at 
sites with SNPs in both humans and chimpanzees are no different 
to other polymorphic sites (Johnson and Helknann 2011; see also 
additional discussion in Johnson and HeUmann 2011 and at http:// 
www.plosbiology.org/annotation/listThread.action?root=21243). Esti- 
mates suggest there might be as much variation in the mutation rate 
that is not associated with context as there is associated with context 
(Hodgkinson et al. 2009). However, predicting whether a site is cryp- 
tically hyper- or hypomutable is likely to be difficult, because the 
increase or decrease in the mutation rate is extremely precise — it 
affects one nucleotide and not the adjoining sites — and it does not 
depend on the adjacent nucleotides. 

MATERIALS AND METHODS 
Data 

We downloaded details of DNMs from the supplementary informa- 
tion of several publications (Conrad et al. 2011; lossifov et al. 2012; 
Kong et al. 2012; Michaelson et al. 2012; Neale et al 2012; O'Roak 
et al. 2011; Sanders et al 2012). If necessary these were lifted over to 
hgl8, the human genome assembly used by Michaelson et al. (2012). 

Simulating data 

We simulated data under the model of Michaelson et al (2012) as 
follows. First, for a dataset of DNMs we regressed, using weighted re- 
gression, the log of the observed number of DNMs per site, z, against 
MI, to yield the relationship between the mutation rate and MI under 
the Michaelson model. Because there are a limited number of DNMs for 
some MI values, we binned the MI values into groups of 10 and re- 
moved those bins that had 5 or fewer DNMs. Using the regression 
equation, and the number of sites, we predicted the expected number 
of mutations at sites with an MI of x, Z(x). To generate data under the 
assumption that the Michaelson model explains all the variance in the 
mutation rate, we sampled from a Poisson distribution with expected 
values Z(x). To investigate the effect of variance unexplained by the 
Michaelson model, we added an additional step to the simulation. Hav- 
ing used the regression model (of log(DNMs per site) vs. MI) to predict 
the expected number of mutations for a site with an MI of x, Z(x), we 
multiplied this by a random variate drawn from a lognormal distribu- 
tion with variance = T/«, where n is the number of sites and T is the 
variance in the mutation rate at each site unexplained by the Michaelson 
model. This yielded the expected number of mutations for sites with an 
MI of x; to generate the observed number we sampled from a Poisson 
distribution. The logic is as follows; the mean mutation rate for sites with 
an MI of x is Z(x), but the rate of a particular site is Z(x)a where a is 
a random variate that is lognormally distributed. Because the mean of n 
lognormaUy distributed variates, each with a variance T, is itself approx- 
imately lognormal with a variance equal to T/n (Beaulieu et al 1995; 
Fenton 1960), we can simulate the effect of unexplained variation among 
sites with an MI of x by multiplying the expected mutation rate by 
a random lognormal variate with variance T/n. We generated 1000 
simulated datasets and calculated the correlation between MI and the 
log of the simulated number of mutations per site. Occasionally the 
simulation would generate no DNMs for an MI value; we removed 



these datasets. We then compared the correlation between the log of 
the observed ntmiber of DNMs and MI against the correlation between 
the log of the simulated number of DNMs and MI. To take into account 
the uncertainty in the relationship between the log of the observed 
mutation rate and MI, we bootstrapped the data before performing 
the regression by resampling the data points from the regression. 

Coincident SNP analysis 

We investigated the difference in the mutation rate between sites with 
and without a coincident SNP (a site with a SNP in both humans and 
chimpanzees) as follows. We assume that the distribution of mutation 
rates is a gamma distribution arbitrarily scaled such that the mean of 
the distribution is one; it is therefore characterized solely by its shape 
parameter. We also assume that hypermutable sites destroy them- 
selves when they mutate. This assumption makes little difference to 
the non-CpG analysis but reduces the level of variation needed to 
explain the coincident SNPs in the CpG analysis. Hodgkinson et al. 
(2009) have shown that under this model the probability of observing 
a coincident SNP at a site is 

P = u^u, jD{y) {e-^yy^ + (l - e-^^y))dy (1) 

where ui, and are the density of SNPs in the two species being 
considered, v is the average divergence between the species and D(y) 
is the distribution of the rates. Therefore the average mutation rate 
of sites with coincident SNPs, relative to the average mutation rate 
(arbitrarily set to one) is 

J{D{y)e-^yy^ + {l-e-^^y))ydy 
^ J{D{y)e-^yy^ + {l-e-2-y))dy ^' 

In our calculations, we assume that the divergence at non-CpG sites 
between human and chimpanzee sites is 0.0092 (Chimpanzee- 
Sequencing-and-Analysis-Consortium 2005) with the divergence at 
CpG sites lOx higher at 0.092 (Chimpanzee-Sequencing-and-Analysis- 
Consortium 2005; Hwang and Green 2004). 

We used the coincident SNPs found in humans and chimpanzees 
discovered by Leffler et al. (2013). We did not exclude coincident 
SNPs that they had inferred to be subject to balancing selection be- 
cause there were <300 of these of a total of 34,000 coincident SNPs 

RESULTS AND DISCUSSION 

Michaelson et al. (2012) assessed the fit of their model by comparing the 
mutation rate predicted by their model to the observed number of DNMs. 
In doing so, they averaged the mutation rate across thousands if not 
millions of sites, potentially hiding unexplained variance in the mutation 
rate. To investigate how much unexplained variance there might be, 
we simulated data under their model (henceforth referred to as the 
Michaelson model) with and without additional variance. In the sim- 
ulation we estimated the relationship between MI {i.e., the Michaelson 
model) and the rate of mutation using sets of DNMs. We then used 
this relationship to predict the expected number of mutations at a site 
and then simulated data based on these expectations with and without 
additional variation in the mutation rate (details in the section Materials 
and Methods). We performed the analysis for three sets of DNMs: 
(i) the 652 DNMs reported by Michaelson et al. (2012) and used to 
build the model upon which the MI values are based (referred to as 
the Michaelson data), (ii) 1380 DNMs reported by various other studies 
(Conrad et al. 2011; lossifov et al. 2012; Neale et al 2012; O'Roak et al. 
2011; Sanders et al 2012) (Otiier data), and (iii) 4933 DNMs reported 
by Kong et al (2012) (Kong data; note that only DNMs with an MI 
value were included). 
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As previously shown by Michaelson et al. (2012), the correlation 
between the log of the number of DNMs per site and the MI value is 
very strong for the Michaelson data (r = 0.98, P < 0.001; Figure lA); 
this finding is perhaps not surprising, given that these were the data 
used to construct the Michaelson model and the model is relatively 
parameter rich. However, as Michaelson et al. (2012) showed, their 
model also fits the data from other studies well (r = 0.97, P < 0.001; 
Figure IB), although there is a clear nonlinearity in the relationship (a 
quadratic term in a nonlinear regression is significant P = 0.010). 
However, the fit of the Michaelson model to the Kong data, which 
Michaelson et al. (2012) did not study, is relatively poor (r = 0.94, 
P < 0.001; Figure IC). The problem would seem to He with the Kong 
data, since the model fits the other two datasets well. The slope of the 
regression line from the Kong data [0.0047 (0.0006)] is significantly less 
than that observed for the Michaelson et al. (2012) [0.010 (0.0007)] and 
Other datasets [0.0084 (0.0007)], suggesting that there has been sys- 
tematic underreporting of DNMs from the more mutable areas of the 
genome in the Kong et al. (2012) dataset (or alternatively, that there are 
large numbers of false positives in the less mutable parts). 

If we assume that the Michaelson model explains aU the variation 
in the mutation rate, we find that simulated datasets have similar 
levels of correlation between the log of the number of DNMs per site 
and MI to that observed in the real data for the Michaelson and other 
datasets, i.e., any variance not explained by the Michaelson model can 
be attributed to sampling error due to the fact that there are limited 
number of DNMs (Table 1). AU the simulated correlations are stron- 
ger than the observed correlation in the Kong data, but this is prob- 
ably because the Michaelson model fits these data poorly. 

However, despite the good fit between model and data for two 
of the datasets, we find that there could be very substantial levels of 
unexplained variance and the correlations would remain almost 
unaffected (Table 1). Only when the variance associated with the 
unexplained variance approaches lO"" do we see the correlations being 
affected and approaching the values seen in the real data. This level of 
variance dwarfs that explained by the Michaelson model; the coeffi- 
cient of variation in the mutation rate explained by the Michaelson 
model is 1.10, the coefficient of variation for the unexplained variation 
is 300 if the variance is 10^. This analysis therefore shows that there 
could be a substantial amount of unexplained variance that would 
never be detected, assessing model fit as Michaelson et al. (2012) have 
done. However, it does not show that this unexplained variance exists. 

Assessing model fit is not easy with these datasets; there are very few 
DNMs spread across millions of sites. We therefore sought to test one 
component of mutation rate variation that is both substantial and likely to 
be diflicult to predict, the so-called cryptic variation in the mutation rate 
(Hodgkinson et al 2009; Johnson and Hellmann 2011). This is variation 
at the single-nucleotide level that is independent of adjacent nucleotides. 

To investigate whether the model of Michaelson et al. (2012) 
captures cryptic variation in the mutation rate we proceeded as fol- 
lows. Sites with SNPs in both humans and chimpanzees are inferred to 
have greater mutation rates than sites without coincident SNPs. Using 
some simple theory, we can estimate how much more mutable these 
sites need to be on average to yield the observed excess of coincident 
SNPs, and we can then compare this with the difference in mutation 
rate predicted by the Michaelson model for sites with and without 
coincident SNPs. If the Michaelson model captures cryptic variation 
in the mutation rate then it should correctly predict the level of 
variation needed to explain the observed excess of coincident SNPs. 

Although coincident SNPs and DNMs are measuring variation in 
the mutation rate over very different time scales, it is expected that any 
source of variation identifiable through coincident SNPs should also 
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Figure 1 The log of the number of de novo mutations (DNMs) per site 
vs. the mutability index (Ml). The mutability index for the (A) Michael- 
son, (B) Other, and (C) Kong datasets. 



be present in DNMs. In fact if variation in the mutation rate evolves 
through time, we would expect the variation associated with co- 
incident SNPs to be less than that present in DNMs. 

Leffler et al. (2013) have shown, using a carefully curated dataset of 
human and chimpanzee SNPs, that there is a 16% excess of coincident 
SNPs at CpG sites (95% confidence intervals of 14% and 17%) and 
a 83% (80%, 86%) excess at non-CpG sites between human and 
chimpanzee, when correcting for the nucleotides at the adjacent sites. 
Assuming that rates are gamma distributed across sites, we estimate, 
using the theory set out by Hodgkinson et al. (2009), that sites with 
coincident SNPs are 1.40-fold (95% confidence intervals of 1.35, 1.43) 
and 2.71-fold (2.65, 2.77) more mutable on average than the genomic 
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■ Table 1 The proportion of simulated datasets with a greater 
correlation between the log of the number of DNMs per site and 
Ml than observed in the actual data 

V Michaelson eta/. (2012) Other Kong et a/. (2012) 
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average for CpG and non-CpG sites respectively. How do these values 
compare to those under the Michaelson model? Under the Michaelson 
model, we find that sites with coincident SNPs have significantly greater 
MI values at both CpG (mean MI for coincident sites = 91.6, non- 
coincident sites = 81.4; P < 0.001) and non-CpG sites (coincident sites = 
— 7.77, noncoincident sites = —16.0, P < 0.001). However, the differ- 
ences in MI are small and equate to minor differences in the mutation 
rate predicted using the regression model from the data of Michaelson 
et al. (2012); coincident SNPs are predicted to be 1.21-fold (1.21, 1.21) 
more mutable at CpG and 1.17-fold (1.17, 1.17) more mutable at non- 
CpG sites. Thus, our analysis suggests that the Michaelson model may 
capture a substantial fraction, although by no means all, of the variation 
at CpG sites; the level of variation required to explain the excess of 
coincident SNPs at CpG sites is such that we would expect sites with 
coincident SNPs to be 1.40-fold (1.35, 1.43) more mutable than non- 
coincident sites and the Michaelson model predicts them to 1.21-fold 
(1.21, 1.21) more mutable; these estimates are significantly different (P < 
0.001). However, the Michaelson model seems to fail to capture most of 
the variation at non-CpG sites; sites with coincident SNPs are expected 
to be 2.71 -fold (2.65, 2.77) more mutable than noncoincident sites, but 
the Michaelson model predicts them to be only 1.17-fold (1.17, 1.17) 
more mutable. These estimates are significantiy different (P < 0.001). 

Although the Michaelson model appears to explain a substantial 
fraction of the variation at CpG sites this might be deceptive because 
the level of cryptic variation may have been tmderestimated by 
analyzing coincident SNPs. Cryptic variation is, by its definition, 
variation that does not depend upon the context of the immediately 
adjacent nucleotides. However, it is possible, in fact likely, that cryptic 
variation depends upon context in some maimer, may be through 
a dispersed context (for example, if the second nucleotide upstream is 
an A and fifth nucleotide downstream is a C, etc). If this is the case 
then as the context evolves so the mutation rate at the focal site wiU 
change. This wiU lead to a reduction in the correlation in the mutation 
rate between sites in humans and chimpanzees and hence to a lower 
level of variation inferred from coincident SNPs. However, one might 
argue that the level of variation is overestimated by a consideration of 
coincident SNPs, because SNPs can be subject to natural selection, 
which will tend to increase the number of coincident SNPs by 
excluding SNPs from certain parts of the genome. This seems 
unlikely because less than 10% of the human genome is inferred to 
be subject to selection (Eory et al. 2010; Lunter et al. 2006). 

Currently we do not have a robust estimate of how much of the 
total variance in the mutation rate is contributed by cryptic variation. 
Initial rough estimates suggest that there is approximately as much 



variation associated with cryptic variation as there is with simple 
context {i.e., CpG type effects) (Hodgkinson et al. 2009). Because 
variation at the single-nucleotide level appears to contribute the most 
to variation in the mutation rate (Hodgkinson and Eyre- Walker 201 1) 
it would seem that the Michaelson model is missing a major compo- 
nent of the variation, particularly at non-CpG sites. 
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